Partly Supervised Uighur Morpheme Segmentation
نویسندگان
چکیده
This paper introduces Uighur morpheme segmentation, which is a basic part of the comprehensive effort of the Uighur language corpus compilation, conducted at Xinjiang University in cooperation with Kyoto University. Uighur is an agglutinative language with word structures formed by productive affixation of derivational and inflectional suffixes to stems. Derivational suffixes change the meaning of the stems, while inflectional suffixes define grammatical functions, such as cases, of the stems. The surface realization of words is also constrained by phonetic rules such as phonetic harmony and vowel weakening, but the surface form of the stem is basically unchanged except for the last vowel. For example, the words “adam+lar, adam+ni, adam+ga, adam+ning, adam+dak” are formed by attaching different suffixes “lar, ni, ga, ning, dak” to the stem “adam (meaning person)”. There are also complex suffixes or compound suffixes. They cause a huge number of combinations, thus the morpheme segmentation is the vital part of the Uighur language analysis. We compiled lists of 38500 stems and 325 singular suffixes to cover most of general words. Then, a list of compound suffixes is collected in an unsupervised manner from our corpus of 200K words by matching with the basic list. With manual checking, 5880 compound suffixes were obtained. For automatic morpheme segmentation, we apply a forward and backward matching algorithm based on the list. One of the biggest problems is vowel weakening, that is, the last vowel of the stem “a” or “ä” is often replaced by another vowel “i” or “e”. The phenomenon is observed The work is funded by Natural Science Fund of China (No:60662002) for 12% of the words in our corpus. Thus, we have devised substitution rules, but these cause ambiguity in the morpheme segmentation. When more than one segmentation hypotheses are generated, the hypothesis with a longer stem is preferred; this is a safe heuristics. Phonetic harmony is also a key factor that controls the stem-suffix connection and syllable concatenation. Thus, we have also introduced phonetic harmony rules which constrain the connection of the stems and suffixes in terms of the smooth articulation. For example, some voiced consonant at the end of a stem must be followed by a suffix starting with a voiced consonant. This constraint will effectively reduce the ambiguity. The method was evaluated with 18400 words chosen from our corpus, and the accuracy of stem-suffix boundary detection is 96% and the accuracy of all stem/suffix segmentation is 92%. The result is encouraging since stems of some words, such as new words imported from English, are not included in the stem list. We are investigating an automated method based on a statistical model to cope with them.
منابع مشابه
A semi-supervised learning approach for morpheme segmentation for an Arabic dialect
We present a semi-supervised learning approach which utilizes a heuristic model for learning morpheme segmentation for Arabic dialects. We evaluate our approach by applying morpheme segmentation to the training data of a statistical machine translation (SMT) system. Experiments show that our approach is less sensitive to the availability of annotated stems than a previous rule-based approach an...
متن کاملMorpheme Segmentation and Concatenation Approaches for Uyghur LVCSR
In this paper, various kinds of sub-word lexica are thoroughly investigated under the framework of Uyghur LVCSR system. Experimental results show that it is inefficient to directly model based on word units or small units like morpheme or even syllable units. It is observed that an optimal sub-word unit set between word and morpheme units can better fit for ASR system. In order to select best u...
متن کاملSemi-supervised Learning for Mongolian Morphological Segmentation
Unlike previous Mongolian morphological segmentation methods based on large labeled training data or complicated rules concluded by linguists, we explore a novel semi-supervised method for a practical application, i.e., statistical machine translation (SMT), based on a low-resource learning setting, in which a small amount of labeled data and large amount of unlabeled data are available. First,...
متن کاملWeakly Supervised Morphology Learning for Agglutinating Languages Using Small Training Sets
The paper describes a weakly supervised approach for decomposing words into all morphemes: stems, prefixes and suffixes, using wordforms with marked stems as training data. As we concentrate on under-resourced languages, the amount of training data is limited and we need some amount of supervision in the form of a small number of wordforms with marked stems. In the first stage we introduce a ne...
متن کاملRule-based Person Name Recognition for Xinjiang Minority Languages
Xinjiang multi-nationality name entity recognition is an important part in multi-language processing. In this paper, we analyze the patterns of Uighur and Kazak person names, and perform the name identity recognition using rule-based approach. We also propose and implement the rules for Uighur and Kazak word segmentation.
متن کامل